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Abstract 

Symbolic execution is a technique which enables automatically generating test inpnts (and 
outputs) exercising a set of execution paths within a program to be tested. If the paths 
cover a sufficient part of the code under test, the test data offer a representative view of 
the program’s actual behaviour, which notably enables detecting errors and correcting 
faults. Relational databases are ubiquitous in software, but symbolic execution of pieces 
of code that manipulate them remains a non-trivial problem, particularly because of the 
complex structure of such databases and the complex behaviour of SQL statements. In 
this work, we define a direct symbolic execution for database manipulation code and 
integrate it with a more traditional symbolic execution of normal program code. The 
database tables are represented by relational symbols and the SQL statements by relational 
constraints over these symbols and the symbols representing the normal variables of the 
program. An algorithm based on these principles is presented for the symbolic execution 
of Java methods that implement business use cases by reading and writing in a relational 
database, the latter subject to data integrity constraints. The algorithm is integrated in 
a test generation tool and experimented over sample code. The target language for the 
constraints produced by the tool is the SMT-Lib standard and the used solver is Microsoft 
Z3. The results show that the proposed approach enables generating meaningful test 
data, including valid database content, in reasonable time. In particular, the Z3 solver is 
shown to be more scalable than the Alloy solver, used in our previous work, for solving 
relational constraints. 
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1. Introduction 


In current software development practice, testing mm remains the primary approach to 
detect errors in software. Testing being a complex and costly activity has motivated much 
research on efficient techniques to automate all aspects of the software testing process [3]. 
Of particular interest is the automation of test data generation ^ for functional testing 
of units of code, where the idea is to automatically generate a representative set of inputs 
(and outputs) for a unitary program fragment under test (typically a function or method). 
These data can subsequently be compared with the function’s expected behaviour in 
order to detect errors and correct faults. Moreover, once a suitable set of test data has 
been generated (and verified), it can be used as reference data for continued (regression) 
testing of the code. 

While different approaches exist towards the automatic generation of test data, symbolic 
execution [5] has been recognised as a promising technique for so-called white-box or 
glass-box testing lilTlIHlISlIin], where the idea is to generate test data that in some way 
cover a sufficiently large part of the control-flow graph of the function under test |1] . The 
symbolic execution process traverses the control-flow graph of a program or function by 
executing the code over symbolic values instead of concrete values [S] . Each time a control 
dependency is encountered, the symbolic execution process proceeds along one of the 
possible paths, thereby generating constraints upon the symbolic values such that when 
the program variables have values that satisfy these constraints, the real execution would 
proceed along the selected path. For each path, the constraints that have been collected 
along that path are regrouped in a so-called path-constraint which is subsequently solved, 
resulting in a set of concrete test values for the program variables that make the real 
execution proceed along the given path. The process terminates when test data have been 
generated for a sufficiently large (and diverse) number of paths through the control-flow 
graph, according to some coverage criterion [J]. Test data generation based on symbolic 
execution is now at the core of various popular open-source and commercial testing tools, 
some being used in an industrial context, notably at Microsoft and NASA [9]. 

t 

Symbolic execution of imperative code has been widely studied [m na [la E] , as 
well as generalised to lower-level or higher-level programming paradigms, such as for x86 
assembly m or Java Haim and C# [IH] object-oriented code. In so-called database 
programs, such classical code is mixed with SQL code to interact with a relational 
database. Enabling symbolic execution in the presence of SQL statements among the 
classical statements is a non-trivial extension of existing symbolic execution techniques. 

From a theoretical standpoint, deciding the satisfiability of an SQL query is not 
computationally possible in general |19j . so that generating test data for any particular 
execution path, in the presence of SQL statements in the path, is not generally computable 
either [30]. From a more practical standpoint, the database can be seen as nothing else 
than a particular kind of container for some of the values manipulated by the program. 
During symbolic execution, these values must thus be represented symbolically and 
subsequently constrained to enable the proper generation of test data, including an 
input and output database content. The symbolic representation of these values raises 
difficulties as the database is a container of particularly complex shape: its content must 
obey the so-called database schema, defined using SQL DDL code m- This database 
schema defines a set of tables where the database content will be stored. Each table can 
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be seen as a container for a mathematical relation, i.e., a set of tuples with no limit on 
the potential number of tuples. The schema typically also describes a set of data integrity 
constraints that must be enforced by the relations in the tables, like the primary key, 
foreign key or check constraints. These constraints are particularly complex as they are 
quantified first-order logic constraints. For instance, a primary key constraint states that 
for all couples of tuples in the relation represented by a table, the value of the primary 
key fields cannot be all equal. 

Moreover, a program typically interacts with the database by using SQL [2T] query 
statements and DML statements that are embedded in the program’s source code. As 
such, SELECT queries enable gathering values from the database tables in order to copy 
them into the program variables. DML statements INSERT, UPDATE and DELETE 
enable modifying the content of the database tables, typically a function of the value 
of the program variables. Symbolic execution of such SQL statements raises practical 
difficulties as well, due to their complex behaviour. Eirstly, SQL is a declarative language: 
SQL statements express the desired action over the content of the database relations, but 
they do not make explicit the (often complex) control-flow necessary to compute this 
action. In practice, during execution, these SQL statements are sent by the database 
program - using a dedicated API - to the DataBase Management System (DBMS) |21) . 
an external component responsible for the interpretation and execution of SQL code over 
the database. The DBMS keeps an optimised and persistent internal representation of the 
database and manages concurrent distant accesses by enabling the database programs to 
use transactions [5T]. Secondly, the execution of DML INSERT, UPDATE or DELETE 
statements by the DBMS does not only consist in modifying the content of the database, 
but also in checking that these modifications do not leave the database in a state where the 
integrity constraints defined in the schema are violated. If an integrity constraint violation 
is detected, the DML statement execution fails and the database remains unmodified. As 
a consequence, during symbolic execution each INSERT, UPDATE or DELETE statement 
will have to be treated as an if-then-else statement with a particularly complex condition: 

if {Program variables and Database are in a state 

where the SQL statement will not violate any constraint) { 

Execute the SQL statement! 

} else { 

Signal a constraint violation! 

} 


Two research approaches have been developed so far to handle the presence of SQL 
statements within the classical code to be symbolically executed. The first approach 
[221 [2S] introduces new native variables in the classical code to represent the database 
content and replaces the SQL queries and DML statements by native code acting on 
these new variables. Classical symbolic execution can then be applied on this normalised 
version of the database program. The second approach |24j considers the result of any 
query in the tested code as a new array-like program inputs which can be constrained 
within the path constraints. Solving any of these path-constraints enables generating 
input content to insert into the database, so that it will return adequate results for the 
queries along the considered path. 

In this work, we propose a different approach, called relational symbolic execution, which 
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enables a direct translation of both SQL queries and DML statements into constraints. 
Contrary to the normalisation approach, relational symbolic execution makes it possible 
to test the original code directly, without transforming it. Contrary to the input arrays 
approach, relational symbolic execution makes it possible to test programs which can also 
write data into the database, using DML statements. 

In practice, relational symbolic execution models every table in the database as a 
program variable typed as a mathematical relation. Each SQL statement in the program 
can then be modelled as a relational operation over these relational variables as well as 
the traditional program variables. By defining a symbolic execution over this relational 
version of the program, we can derive a set of path constraints over the values of both the 
program’s relational and traditional variables. The generated path constraints will thus 
include symbols representing classical program values as well as symbols representing the 
relations in the database tables. Furthermore, each path constraint must be combined 
with schema constraints over the relational symbols, in order to enforce data integrity of 
the database content. The result is a complex constraint system that mixes traditional 
constraints over the content of the program’s variables with relational constraints over 
the content of the relational variables. Each solution, instantiated to the combined 
constraint system, describes test data, including a valid initial and final state for each 
table in the database, such that when the program is executed with respect to these data 
values (including the database), the execution will follow the path represented by the 
constraint system. 

The main contribution of this work is a relational symbolic execution algorithm, which 
implements the technique described in the previous paragraph for a precise subset of Java, 
empowered by the JDBC API for using SQL primitives and transactions. The algorithm 
has been designed in the context of testing transactional software, where database programs 
can typically be divided into a set of rather small methods, implementing each a precise 
business use case, like registering a book borrowing or saving a sale transaction. Such an 
use case involves a short sequence of SQL actions touching a limited number of entities in 
the database, and must be executed in real-time. Given the Java/SQL code of a single 
method implementing a business use case, the SQL DDL code describing the schema of 
the database manipulated by this method and a finite path in the control flow graph of 
the method, the algorithm generates the corresponding constraints. 

A major challenge for relational symbolic execution is solving the mix of relational 
and classical constraints produced by the approach. In our previous work |251126j . these 
constraints were expressed using the Alloy language and solved using the Alloy 
analyser [27]. Recently, it has been proposed [28| to translate Alloy constraints into the 
SMT-Lib standard language |29|, as some solvers based on this language, like Microsoft Z3 
ESI, enable detecting the unsatisfiability of the constraints, which is not possible within 
the Alloy framework. However, |5B| also advises to use the Alloy analyser for finding 
solutions to satisfiable constraints, as Z3 had exhibited some limitations for this task, due 
to the unavoidable presence of quantifiers when translating relational constraints into 
SMT-Lib. Nevertheless, more recent versions of Z3 take advantage of new model-based 
quantiher instantiation capabilities (MBQI) |31j . providing the solver with a promising 
ability to find efficiently solutions for satisfiable quantified constraints. In this paper, 
the relational symbolic execution algorithm from our previous work has been redesigned 
to generate constraints in the SMT-Lib language and solved using an MBQI-powered 
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version of Z3. Z3 is at the core of several existing state-of-art symbolic execution tools 
(e.g. mus]). Moreover, it handles a much larger and more various panel of constraints 
than Alloy, the latter being limited to bounded-scope integer and relational constraints. 
Using Z3 as a back-end solver does strengthen thus the generalisability of our approach. 

A test generation tool based on our new relational symbolic execution algorithm and 
the Z3 solver has been coded and used to generate test data for a number of sample 
Java methods and databases, including open-source code extracted from the web. These 
experiments enabled a comparative evaluation of Alloy and Z3 for solving the generated 
constraints. They also made it possible to compare, to some extent, the performance 
of our approach with that of related work based on translation of SQL code into native code. 

The remainder of this paper is organised as follows. The relational symbolic execution 
algorithm is described in section After discussing the transactional context for which 
the algorithm was designed, we formally define the part of the Java/SQL syntax that is 
supported. Then, we systematically describe the constraint generation rules to be used 
for the symbolic execution of the supported language. The test generation tool based on 
the algorithm is described, together with our experimental evaluation of the approach, in 
section]^ Finally, some conclusions, related and future work are discussed in section]^ 
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2. A relational symbolic execution algorithm for transactional use cases 

2.1. Database programs in transactional software 

Interaction between software and a relational database is often said to occur either in an 
OLTP (On-Line Transaction Processing) or in an OLAP (On-Line Analytical Processing) 
context. In a nutshell, OLTP corresponds to the case where the database must be read 
and written to support human activities in real time, like managing concurrent bank 
transactions, updating customer files or processing ticket registrations. The OLAP context 
occurs when the data gathered in the database are patiently analysed, using data mining 
techniques, typically for reporting or strategical purposes, like accounting, budget or 
marketing. 

Our algorithm has been designed in the context of OLTP. The documentation of 
Microsoft’s DBMS SQL Server 201^ describes database programs designed in an OLTP 
context as ’’characterised by small, interactive transactions that generally require sub¬ 
second response times”. More precisely, a characteristic of the database programs written 
in an OLTP context is that they can typically be divided into rather small [32] methods 
’’implementing [business] use cases which execute a sequence of actions whereas each action 
usually reads or updates only a small set of tuples in the database” [33]. The purpose of 
our work is to generate test data for such methods. An example extracted from [83j of an 
use case implemented by such a method is a library user that wants to borrow a book. 
The code checks the user and book data and possibly allows and saves the borrowing. 

2.2. Defining a core language for relational symbolic execution 

Database programs can be written in many different programming languages, using 
different DBMS interfaces, and taking advantage of the numerous features of the various 
SQL dialects. This strong variability makes it necessary to define a framework to properly 
study our relational symbolic execution approach. Practically, we have chosen to define 
a formal subset of the language made of Java empowered by JDBC [33] and ISO SQL 
[33] . which are all very popular technologies. The defined sublanguage contains base 
SQL primitives as well as a Turing-complete subset of Java. The symbolic execution 
of many of the Java constructs not considered here has been studied elsewhere (see e.g. 
[ME]) and is a problem orthogonal to this work, which studies the integration of SQL 
handling into classic symbolic execution. The considered sublanguage allows integers as 
the only datatype for the values stored in the database. Non-linear integer arithmetic 
operations are not permitted by the sublanguage. The sole impact of allowing other 
datatypes, like bit-vectors, reals, timestamps or strings, as well as of allowing non-linear 
arithmetics, would be that our algorithm would produce bit-vector, real, timestamp, 
string and non-linear arithmetics constraints, in addition to the linear integer constraints 
produced here. Solving more heterogeneous constraint sets is evaluated in the discussion 
part of the paper. 

Concretely, our language describes OLTP business use cases. Such an use case is 
composed of the manipulated database’s DDL schema and of the code of the Java method 
implementing the use case. The database schema describes a set of tables composed of 
integer attributes. Each table has a primary key and the attributes can be constrained 


^http: //technet .microsoft. coin/eii-us/library/hh393556 (v=sql. 110). aspx 
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by foreign key constraints and check constraints. The code of the method can contain 
if-then-else blocks, while loops and return statements. It can contain integer and list 
(local) variable assignments, with typical operators for lists and linear operators for 
integers. Assertions can be declared in the code. The method interacts with the database 
through SQL base statements whose structure is entirely described statically, and through 
SQL base primitives for transaction management. Statements writing in the database 
can throw runtime exceptions if the write operation violates the DDL schema. Such 
exceptions can be caught. The method receives as input parameters a JDBC connection 
to the database, a set of integer lists and an input scanner for integers. The lists model 
any structured group of inputs transmitted to the code at method call. The scanner 
models the method’s access to simple data from the ’’outside world”, like user prompt, 
network access, etc. 

Figure provides a sample use case written in the language considered here. This 
sample describes a database with two tables: one for library shelves and one for the books 
stored in each of these shelves. The total number of books stored in a shelf is saved for 
each shelf. The Java method adds a set of new books to the database and updates the 
shelves’ books counts. If a book is added to a non-existent shelf, then the shelf is itself 
added to the database as well. The books are inserted one by one in isolated transactions. 
If a transaction was successful, the code of the added book is saved in a list, which is 
returned at the end of the method’s execution. 

2.3. A formal definition for our core language 

We define now precisely - using a BNF grammar -what subset of the Java/SQL syntax 
our algorithm can execute symbolically. This description is made with the implicit 
requirement that the written code is well-typed according the relevant standard Java and 
SQL typing rules. 

In the next paragraphs, the chosen notation for the BNF is standard but includes some 
additional meta-symbols: {...} (grouping), ^ (zero or one times), * (zero or more times) 
and ^ (one or more times). When a single nonterminal appears several times in a single 
production, subscript notation enables distinguishing between the occurrences. 

2.3.1. OLTP use case 

An OLTP use case is composed of the SQL DDL code of the manipulated database 
and of the code of the Java method implementing it. 

{oltp-use-case) ::= (sql-ddl) (Java-method) 

2.3.2. Database schema 

The relational database schema is a list of table definitions. This list can be empty, in 
which case the method is a classical method that works independently of any database. 
In the list, each table is identified by its name, contains at least one attribute and 
endorses exactly one primary key. Foreign keys and additional check constraints can be 
declared for a table. A row in a table cannot be deleted or see its primary key value 
modified as long as there exists at least another row in the database that references it (ON 
DELETE/UPDATE NO ACTION). The semantics of all the schema creation primitives 
conforms to the standard [3S] SQL DDL. 
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(sql-ddl) (table)* 

(table) ::= CREATE TABLE (id) ((att)* (p-key) (f-key)* (chk)*); 

(att) (id) INTEGER NOT NULL , 

(p-key) ::= CONSTRAINT (id),,, PRIMARY KEY ( (id),„ ) 

(f-key) .CONSTRAINT (id),„ FOREIGN KEY ( (id),,, ) REFERENCES (id),,,, ( (id)„f,,, ) 

(chk) ::= .CHECK ((id) {< | = | >} (integer)) 

(id) ■- {a |...| z I A |...| Z}{a |...| z I A |...| Z | 0 |...| 9}* 

(integer) ::= {1 | ... | 9}{0 | ... | 9}* | 0} 

2.3.3. Method signature and body 

We consider Java methods manipulating only internal variables and parameters. 
Variables can only be typed as ‘int’, ‘Java.util.List<Java. lang. Integer>’ or 
‘Java.sql.ResultSet’. The method receives as input parameters a connection to the 
database (typed as ‘Java.sql.Connection’), a scanner (typed as ‘Java.util.Scanner’) 
and some lists of integers (typed as ‘Java.util.List<Java.lELng.lnteger>’), where two 
distinct list parameters cannot reference a single list object. Its return type can be either 
‘void’, ‘int’ or ‘Java.util. List<Java. lang. lnteger>’. 

(.Java-method) ::= (type) (id) ((db-con),(inp) (parameters)) throws SQLException 
{ (stmt)* } 

(type) void 
int 

I List<Integer> 

(db-con) ::= Connection con 
(inp) Scanner in 

(parameters) ::= {. List<Integer> (id) }* 

The connection with the database is assumed reliable and every SQL statement, being 
well formed, is processed for its effect on the database. The semantics of all the Java 
constructs conforms to the classical Java specification and documentation. The semantics 
of all SQL statements conforms to the standard [3S] SQL specification. 

2 . 3 . 4 . Common statements and list management 

Common condition, loop, assertion and assignment statements, as well as com¬ 
mon integer expressions and boolean conditions can be used. Lists can be ma¬ 
nipulated with ‘add(int)’, ‘remove(int)’, ‘get(int)’ and ‘sizeO’ methods. The 
‘Java.util.ArrayList’ implementation of these methods is assumed to be used. A 
list variable can be ‘null’. 

(stmt) if ((cond)) {.(stmt),,,,,*} {else {.(stmt),,,,*}}^; 

I while ((cond)) -[ (stmt)* }; 

I assert (cond) ; 

I {int I List<Integer>}’ (id) = (expr); 

I (id).add( (int-expr) ); 

I (id) . remove ( (int-expr) ) ; 

I return (id); 
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Figure 1: SQL and Java/JDBC code of an use case adding books in a library database. 
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{cond) ::= true 
I false 
I (! {cond}) 

I {{cond)\ {& I 1} {cond) 2 ) 

I {{int-expr)i {< | == | >} {int-expr} 2 ) 

I ({id} == null) 

{expr} ::= {int-expr} \ {list-expr} 

{int-expr} ::= {id} 

I {integer} 

I ({int-expr}i {+ | -} {int-expr} 2 ) 

I ({id}, get ( {int-expr} )) 

I «id).size()) 

{list-expr} ::= {id} 

I null 

I new ArrayList<Integer>() 

Interacting with the outside world. The scanner parameter of the method can be used to 
get integer data from the ’’outside world” (user prompt, network access, reading from a 
hie, etc.). This interaction is assumed to always succeed, without any technical problem. 
We have thus the following new alternative for the {stmt} non-terminal: 

{stmt} ::= {int}’ {id} = in.nextIntO ; 

Reading data from the database. Data can be read from the database using simple SQL 
queries. The obtained ResultSet can be accessed using the ‘next () ’ and ‘getint (String) ’ 
methods. We have thus the following alternatives for existing terminals: 

{stmt} ::= { ResultSet }’ {id} = con. createStatement () . executeQuery (" {select-query} "); 

I (jd).next(); 

{int-expr} .getint(" <id>„„ ") 

{cond} ::= ({id} .next( {int-expr} )) 

as well as the following new terminals: 

{select-query} :;= SELECT {{id}^ ,}*{id}„ FROM { WHERE {db-cond} }■ 

{db-cond} ::= ({db-cond}i {AND j OR} {db-eond} 2 ) 

I (NOT {db-cond}) 

I ({id} {< I = I >} {db-int-expr}) 

{db-int-expr} ::= {id} 

I {integer} 

I ({db-int-expr}I {+ | -} {db-int-expr} 2 ) 

I "+( {int-expr} )+" 

Integer expressions appearing in SQL code {{db-int-expr}) can be parameterised by 
integer constants computed dynamically by the Java method. For example, in: 

void example (Connection con, Scanner in) throws SQLException { 
int n = 0; 

ResultSet r= con.createStatement().executeQuery(’’SELECT id FROM table WHERE id=”+(n)-|-””); 
r .next (); 

int p = r.getint(”id”); 

} 
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The charachter string that will be effectively sent to and processed by DBMS is: 
SELECT id FROM table WHERE id=0 

The parametric constants can depend on the method’s inputs, like in: 

void example (Connection con, Scanner in) throws SQLException { 
int n = in.nextint 0; 

ResultSet r= con.createStatement().executeQuery(’’SELECT id FROM table WHERE id=”+(n)+””); 

} 

Writing data into the database. Data can be written into the database using simple 
SQL INSERT, UPDATE or DELETE statements. If the execution of such a statement 
provokes a violation of one of the database schema integrity constraints, the database 
remains unmodified by the statement, an exception is thrown within the method and its 
execution is stopped. Such exceptions should be caught using a try/catch structure. 

(stmt) ::= con. createStatement (). execute (" {db-write) "); 

I try { con. createStatement 0 . execute (" (db-write) "); I 
catch (SQLException e) 

{ (stmt)* }; 

(db-write) INSERT INTO (id) VALUES ( { (int-expr),, }* (int-expr)„ ) 

I UPDATE (id),ah SET (id)^,, = (db-int-expr) { WHERE (db-cond) }’ 

DELETE FROM (id) { WHERE (db-cond) }’ 

Transaction management. SQL transactions are managed through the classical commit 
and rollback statements. We suppose that a new transaction is automatically started at 
the beginning of the method’s execution. The first call to commit or rollback will end this 
transaction and then starts a new one. Any subsequent call to commit or rollback will 
end the current transaction and start a new one. When a commit statement is executed, 
it makes permanent all the changes made to the database by the method since the current 
transaction was started. When a rollback statement is executed, it restores the database 
to its state at the start of the current transaction. We suppose that all the changes made 
to the database since the last transaction was started are automatically committed at the 
end of the method’s execution. 

(stmt) con. commit 0 ; 

I con.rollback0 ; 

2 . 4 . Relational symbolic execution algorithm 
2 . 4 . 1 . Inputs and outputs 

The proposed algorithm, which is described in this section, receives as inputs the SQL 
DDL code describing the schema of the database, the Java code of the method under test 
and an execution path through this method. It produces as output a constraint system 
mixing classical constraints with relational constraints. Solutions to this system are such 
that when the method is executed with respect to any of these solutions, its execution 
will follow the given path. 

Coupling this algorithm with any existing technique (e.g. [311 [HI 133 I3H1) able to 
explore a set of paths to test in the method’s control flow graph will enable generating test 
data for these paths. The set of paths for which test data are computed, as well as the 
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process used to explore these paths, are thus parameters of the method that we propose. 
This enables the method to be used within the context of different code coverage criteria [3] . 

The execution path received as input by our algorithm is supposed to be a finite path 
in the method’s control flow graph j4]. It defines which branches were taken at each 
of the encountered if statements, how many times the body of each encountered loop 
was executed (this number must be finite), which assertions were violated and, for each 
encountered try/catch statement, whether the catch clause was executed. 

Our algorithm translates the path into a constraint system, combining the path constraint 
with the database schema integrity constraints^ expressed in SMT-Lib |29| . a widely 
adopted language used as the standard language for many SMT solvers. The generated 
constraints fit into the SMT-Lib AUFLIA logic, i.e. they can involve quantified array, 
uninterpreted functions and linear integer arithmetic constraints. 

Solving the constraint system generated by the algorithm for a complete path enables 
finding values for both the inputs and outputs of the analysed Java method. The inputs 
include the content of each database table at the start of the method’s execution, the 
value of every list received as argument by the method, and a value for the part of the 
input stream that is consumed during the method’s execution. The outputs include the 
content of each database table at the end of the method’s execution, the final value of 
each of the argument lists of the method, and possibly the value returned by the return 
statement. If the constraint system produced for a given path has no solution, this means 
that the path is infeasible. As the produced constraints are written in a quantified logic 
that is not generally decidable [2S] , it can happen that for a given path the solver may 
neither be able to find a solution for the generated constraint system, nor be able to 
establish that such a solution does not exist. This is coherent with the problem being not 
computable in general. 


2.4-2. Algorithm principle 

The algorithm performs a symbolic execution of the path received as input. Each of the 
successive values taken by the method’s variables and by the database tables during the 
execution of the path is represented by corresponding symbols and defined by constraints. 

First, symbolic execution generates constraints over the symbols representing the initial 
values of the database tables. These constraints state that, initially, each table contains 
data that conform to the database schema integrity constraints. 

Then, symbolic execution analyses one by one the method’s statements in the order 
specified by the path. Each time a statement sets or changes the value of a method 
variable or database table, symbolic execution generates constraints over the symbols 
representing the new value. These constraints define how the new value can be computed 
from the values of the database tables and method variables before the statement’s 
execution. Moreover, every time the value of a database table is changed, constraints are 
also added to state that the new value satisfies the database schema integrity constraints. 

Einally, every time an if, while, assert or try/catch statement is encountered, symbolic 
execution generates an additional constraint over the symbols such that when the method 
is executed with respect to values satisfying this constraint, the execution is guaranteed 
to take the considered path. 
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2.4-3. Execution example 

In the following paragraphs, we illustrate the execution of the algorithm over the 
sample use case given in Figure (page |^. We detail each step of the symbolic execution 
process over the path where the while loop is executed once, the else branch of both the 
if statements is taken, and the catch clause of the try/catch is executed (lines 1-6, 8-15, 
18-21, 3 and 22). 

At each step, we present the rules used by our algorithm to generate the corresponding 
SMT-Lib symbols and constraints. It should be noted that SMT-Lib syntax is inspired 
by the S-expressions from Lisp, where classical expressions like 2 -H 2 -t- 2 are written 
as (+ 2 (+ 2 2)). All the rules that are part of the complete set of rules defining our 
symbolic execution algorithm for SimpleDB are presented during one of these steps, 
and/or described formally in a set of tables available at the end of this section. 

The first step executed by our algorithm is to generate SMT-Lib symbols and constraints 
for the SQL DDL code of the database schema. For the database schema described in 
Figure the generated SMT-Lib code is detailed in the frame below. For the reader’s 
convenience, the corresponding Java/SQL code will be reminded as a preliminary comment 
in the SMT-Lib code. 

First of all, the algorithm generates new symbol types for the kind of objects stored in 
each table defined by the schema (the lines prefixed by (0) in the SMT-Lib code below). 
It will then generate symbols and constraints describing the input content of each of 
these tables. The used modelling is inspired by the one proposed in [28] for relational 
types. First, a symbol is created (1) to represent the initial set of objects in each table. 
Typed as a boolean function, it returns true for each object present in the input content 
of the table. Symbols typed as integer functions are then generated (2) to associate to 
each object in the table one of its attribute values. Finally, constraints are generated to 
enforce on this input content all the check constraints (3), primary key constraints (4), 
and foreign key constraints (5) dehned in the schema. 

Note that the original SQL table and attribute names, as well as the original Java 
variable names, are used as SMT-Lib symbols, suffixed by the natural number 1 (e.g. 
book! or idl in the SMT-Lib code below), which represents the fact that the current 
symbols represent the initial values of the represented tables, attributes or variables. 
Subsequent values of a same table, attribute or variable will be represented by the same 
symbol suffixed with successive numbers. Moreover, as detecting corner cases linked to 
arithmetic overflow is not a priority of our work, we have used SMT-Lib symbols typed 
as mathematical integers to represent efficiently the fixed-width integers used the code. 

; CREATE TABLE book ( 

; code INTEGER NOT NULL, 

; shelfid INTEGER NOT NULL, 

; GONSTRAINT bPK PRIMARY KEY (code), 

; GONSTRAINT bFK FOREIGN KEY(shel£Id) REFERENGES shelf (id)); 

; CREATE TABLE shelf ( 

; id INTEGER NOT NULL, 

; numberOfBooks INTEGER NOT NULL, 

; GONSTRAINT sPK PRIMARY KEY (id), 

; GHECK(numberOfBooks > 0)); 
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; New types for tables 
(0) (declare-sort book) 

(0) (declare-sort shelf) 

; Input content of table Book 

(1) (declare-fun bookl (book) Bool) 

(2) (declare-fun shelfidl (book) Int) 

(2) (declare-fun codel (book) Int) 

(4) (assert (forall ((a book) (b book)) 

(=> (and (and (bookl a) (bookl b)) (= (codel a) (codel b))) (= a b)))) 

; Input content of table Shelf 

(1) (declare-fun shelfl (shelf) Bool) 

(2) (declare-fun numberOfBooksl (shelf) Int) 

(2) (declare-fun idl (shelf) Int) 

(3) (assert (forall ((a shelf)) (> (numberOfBooksl a) 0))) 

(4) (assert (forall ((a shelf) (b shelf)) 

(=> (and (and (shelfl a) (shelfl b)) (= (idl a) (idl b))) (= a b)))) 

; Foreign keys 

(5) (assert (forall ((a book)) 

(=> (bookl a) (exists ((b shelf)) (and (shelfl b) (= (shelfidl a) (idl b))))))) 

The second step executed by our algorithm is to define a new SMT-Lib symbol type 
(called BoundedList) for lists of integers. All the symbols that will be subsequently 
generated to represent the value of a variable typed as a Java list will be part of this new 
SMT-Lib type. A BoundedList symbol represents a record composed of three fields: the 
isNull field is typed as boolean, the size field is typed as integer and the elements field is 
typed as array of integers. If the isNull field is true, then the symbol represents the Java 
null value. Otherwise, the field size represents the size of the list, and the field elements 
represents an array whose indexes 0 to {size — 1) contain the elements of the list in the 
right order. 

(declare-datatypes () 

((BoundedList (mk-bounded-list (isNull Bool) (size Int) (elements (Array Int Int)))))) 

It should be noted that the BoundedList symbol type is defined using an algebraic 
datatype declaration, where ”mk-bounded-list” will be the ad-hoc constructor for list 
objects. If datatype declarations are handled by several SMT solvers, the related syntax 
has not been standardised in SMT-Lib. In this work, we use the datatype notation 
available in the Microsoft Z3 SMT solver [50] . 

The third step executed by our algorithm is to define symbols (typed as BoundedList) 
for the initial content of each list parameter of the method. For the example method 
considered in this section, the following code is generated: 

: INPUT PARAMETER: List<Integer> uewBooks 
(declare-const uewbooksl BoundedList) 

(assert (=> (not (isNull uewbooksl)) (>= (size uewbooksl) 0))) 

It should be noted that the second constraint enforces a general property of BoundedList 
objects, described earlier. However, it will be enforced one by one for each BoundedList 
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object. This is an optimisation compared to using a more general constraint, quantified 
over the set of all possible BoundedList objects. In order to solve such a constraint, the 
solver would indeed be required to instantiate correctly the quantifier by itself, making 
the constraint more costly to solve. 

The algorithm can then proceed with the symbolic execution of the method. It follows 
the path received as input and considers all the statements one by one. In the case of our 
example, the two first statements to be executed are assignments. Symbolic execution for 
assignment creates a new symbol of the correct type to represent the new value of the 
assigned variable (I) and generates constraints to specify that this new symbol contains 
the value computed by evaluating the expression on the right of the ‘=’ symbol (2). In 
the particular case where a list variable is assigned to a different list variable, the shared 
content of the two variables is represented by a single symbol. 

; int i = 0; 

(1) (declare-const if Int) 

(2) (assert (= if 0)) 


; List<Integer> addedBooks = new ArrayList<Integer>(); 

(1) (declare-const addedbooksl BoundedList) 

(2) (assert (not (isNull addedbooksl))) 

(2) (assert (= (size addedbooksl) 0)) 

The next statement in the path is a while statement. As the path specifies that the 
loop body must be executed, a constraint is generated to specify that the loop condition 
at this point of time should be true: 

; ENTERING LOOP: while ( !(newBooks == null) & (i < newBooks.size()) ) 

(assert (and (not (isNull newbooksl)) (< il (size newbooksl)))) 

Then the algorithm proceeds with symbolic execution of the statements in the loop 
body, as specified within the input path. The first statement is an assignment statement: 

; int error = 0; 

(declare-const errorl Int) 

(assert (= errorl 0)) 

Symbolic execution for use of the input scanner simply creates a new symbol to represent 
the scanned value: 

; int theShelf = in.nextlnt(); 

(declare-const theshelfl Int) 

Symbolic execution for select statements creates new symbols to represent the content 
of the ResultSet variable. A first symbol (1) describes the number of rows returned 
by the select query. These rows are available through a second symbol (2) which is a 
function that returns them in the order in which they are returned by the ResultSet: 
(shelveslList 0) will be the first returned row, (shelveslList 1) the second one and so on. 

; ResultSet shelves 

(1) (declare-const shelveslSize Int) 

(2) (declare-fun shelveslList (Int) shelf) 
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Constraints are then generated to specify that a row is part of the ResultSet if and only 
if it is part of the current content of the table on which the select query is executed and 
that it enforces the WHERE condition of the select query. In order to do so, the modelling 
proposed in [28j for constraining the content and cardinality of relations, in a way so that 
the constraints can be effectively solved by Z3, is used. Eollowing [28], new constraints 
are added (1) to define a function shelveslInvertedList which is the inverse of shelveslList. 
This function is used (1) to ensure that shelveslList defines a one to one correspondence 
between the integers 0 ^ ^ shelveslSize and the elements in the ResultSet. Helper code 
(2) is added to ensure an efficient pattern-based quantifier instantiation |39j by the solver, 
using the : pattern keyword m- 

; = con.createStatement().executeQuery(”SELECT id FROM shelf WHERE id=”-|-theShelf); 

(1) (declare-fun shelveslInvertedList (shelf) Int) 

(2) (declare-fun shelveslTrigger (Int) Bool) 

(1) (assert (and (>= shelveslSize 0) 

(=> (= shelveslSize 0) 

(forall ((c shelf)) (not (and (shelfl c) (= (idl c) theshelfl ))))))) 

(1) (assert (forall ((c shelf)) 

(=> (and (shelfl c) (= (idl c) theshelfl)) 

(and (>= (shelveslInvertedList c) 0) (< (shelveslInvertedList c) shelveslSize ))))) 

(1) (assert (forall ((c shelf)) 

(=> (and (shelfl c) (= (idl c) theshelfl)) 

(= c (shelveslList (shelveslInvertedList c )))))) 

(1) (assert (forall ((i Int)) 

(=> (and (>= i 0) (< i shelveslSize)) 

(= i (shelveslInvertedList (shelveslList i )))))) 

(1) (assert (forall ((i Int)) 

(! (=> (and (>= i 0) (< i shelveslSize)) 

(and (shelfl (shelveslList i)) (= (idl (shelveslList i)) theshelfl))) 

(2) :pattern (shelveslTrigger i)))) 

(2) (assert (=> (>= 0 shelveslSize) (shelveslTrigger 1))) 

(2) (assert (forall ((i Int)) 

(! (=> (and (>= i 0) (< i shelveslSize)) 

(shelveslTrigger (+ i 1))) 

: pattern (shelveslTrigger i)))) 

As the path specifies that the else branch of the if statement must be executed this 
time, a constraint is generated to specify that the condition of the if should be false, i.e. 
that shelves.next0 should return true. 

For each ResultSet object, the algorithm records the number of times the nextO 
method has been called on this object. This value represents the index increased by one 
of the row pointed by the cursor of the ResultSet at the current execution state of the 
path. When the boolean value returned by the ‘nextO’ method is used in an if or while 
condition, this value states if the number of rows in the ResultSet is greater or equal to 
the number of times the ‘nextO’ method has been called so far on this ResultSet. In 
this case, shelves.nextO will return true if the ResultSet shelves contains at least one 
row (as shelves.nextO has been called once on the ResultSet): 

; TAKING ELSE BRANCH OF: if ( ! shelves.next() ) 

(assert (>= shelveslSize 1)) 
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Symbolic execution for update creates a new symbol (1) typed as an integer function, 
that will replace the previous symbol associating the attribute value to each object in 
the table. As this new symbol is the second one to represent the value of the attribute 
numberOf Books, it is named numberOfBooks2. A couple of constraints (2) (3) is then 
generated to relate the old and new attribute values in the table: one for the rows that do 
not match the WHERE condition (2), and one for those that do (3). Finally, constraints 
are added to specify that no integrity constraint was violated during the update. In this 
case, a constraint (4) is added to state that the updated attribute values still enforce the 
check constraint defined in the database schema. 

; con.createStatement().execute( 

; ’’UPDATE shelf SET numberO£Books=nmnberOfBooks+l WHERE id =”+theShelf); 

(1) (declare-fun numberOfBooks2 (shelf) Int) 

(2) (assert (forall ((p shelf)) 

(=> (or (and (shelfl p) (not (= (idl p) theshelfl))) (not (shelfl p))) 

(= (nmnberOfBooks2 p) (numberOfBooksl p))))) 

(3) (assert (forall ((p shelf)) 

(=> (and (shelfl p) (= (idl p) theshelfl)) 

(= (numberOfBooks2 p) (+ (numberOfBooksl p) 1))))) 

(4) (assert (forall ((a shelf)) (> (numberOfBooks2 a) 0))) 

Subsequently, as the path specifies that the catch block of the try/catch statement 
must be executed, a constraint (1) is added to ensure that the method variables and 
the database are in a state where the INSERT execution will violate a schema integrity 
constraint. In this case, the constraint states that the inserted row has a similar primary 
key as the primary key of an existing row in the table or that the inserted row has a 
foreign key value that does not reference any existing row in the shelf table. Constraints 
are also automatically added to ensure that the ‘get(int)’ (2) method does not cause 
any runtime error. 

; TAKING THE CATCH BRANCH OF: 

; try { con.createStatement().execute( 

; ’’INSERT INTO book VALUES (”+newBooks.get(i)+”,”+theShelf+”)”); 

; } catch (SQLException e) { 

(1) (assert (or (exists ((a book)) (and (bookl a) 

(= (codel a) (select (elements newbooksl) il)))) 

(forall ((a shelf)) (=> (shelfl a) 

(not (= (idl a) theshelfl )))))) 

(2) (assert (not (isNull newbooksl))) 

(2) (assert (>= il 0)) 

(2) (assert (< il (size newbooksl))) 


The content of the catch block is then symbolically executed: 

; error = 1; 

(declare-const error2 Int) 

(assert (= error2 1)) 

As the path specifies that the else branch of the if statement must be executed this 
time, a constraint is generated to specify that the condition of the if should be false: 
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; TAKING ELSE BRANCH OF: if (error==0) 

(assert (not (= error2 0))) 

Symbolic execution for Rollback statements tells the algorithm to represent the current 
content of each database table using the symbols that were representing the content of 
the table just before the last start of a new transaction (saved by the algorithm at the 
beginning of the method execution and after each call to commit or abort). In this case, 
the database state is restored to its state at the method start, i.e. the algorithm rewinds 
the counters for the database symbols and symbols bookl, codel, shelfidl, shelf I, idl 
and numberOfBooksl represent the content of the database after the ‘con.rollbackO ’ 
statement. 

The assignment statement is then symbolically executed: 

; i = i + 1; 

(declare-const i2 Int) 

(assert (= i2 (+ il 1))) 

As the path specifies that the loop body must not be executed any more, a constraint 
is generated to specify that, at this point in time, the loop condition should be false: 

; ESCAPING LOOP: while ( !(newBooks == null) & (i < newBooks.size()) ) 

(assert (not (and (not (isNuII newbooksl)) (< i2 (size newbooksl))))) 

As a return statement is met, the symbolic execution can be stopped and the generated 
SMT-Lib constraint model can be returned. The Z3 solver can now be asked to find a 
valuation for the defined symbols satisfying the constraints. As the algorithm records 
what symbols represent the initial, respectively final, values of a variable or table, the 
input and output values of the method (for the considered path) can easily be extracted 
from the solution to the constraint system. 

For our example, the set of 29 constraints was solved by Z3 in 24ms (1.8 GHz Intel Core 
i5, 8GB Ram) and the test data that were obtained from the solution to the constraint 
system are summarised in the following tables: 


Inputs 

Name 


Symbol(s) 


Value 


Outputs 

Name 


Symbolic(s) 


Value 


TABLE 

shelf 


TABLE 

book 


CONTENT: 

shelf I, 

ATTRIBUTES: 

idl 

numberOf Books 1 

CONTENT: 

bookl 

ATTRIBUTES: 

codel 

shelf Idl 


id I n.Books 


6 


1 


12 


code 

s.Id 

4 

12 


TABLE 

shelf 


TABLE 

book 


CONTENT: 

shelf 1^ 

ATTRIBUTES: 

idl 

numberOfBooksl 

CONTENT: 

bookl 

ATTRIBUTES: 

codel 

shelf Idl 


id 

n.Books 

6 

1 

12 

1 


code 

s.Id 

4 

12 


newBooks 

in.nextlnt() 


newbooksl 

theshelfl 


[4] 

[ 6 ] 


newBooks 

addedBooksI 


newbooksl 

addedbooksl 


[4] 

[] 
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2.4-4- Constraint Generation Rules 

For sake of completeness, the following tables define the constraint generation rules 
used by our algorithm in case of an Insert (table , Update (table , Delete (table 
and Add/Remove (table statement. Table explains the abbreviations to be used in 
the other tables. 

In each of these tables, any declaration of a new symbol leverages a generator providing 
a fresh symbol identifier, i.e. which has still not been used in the SMT-Lib code generated 
so far. This is denoted by (for a function declaration): 

(declare-fun freshSym Type) 

and by (for a constant declaration): 

(declare-const freshSym Type) 

All the subsequent references to freshSym in the table represent this newly declared 
symbol. The generated fresh symbols are named according to the naming rule detailed 
along the example given in the previous subsection. 


Finally, it should be noted that assertions are handled as if statements. For example : 



Assertions are particularly useful for test data generation as they let the programmer 
express additional constraints otherwise non-obvious to the solver. 
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Table 1: SMT-Lib Translation Abbreviations List 


Abbreviation 

Meaning 

smtlO f(x) 

Java condition/expressioii x translated into a corresponding SMT-Lib 
condition/expression. 

smt20f(x, t, r) 

SQL condition/expression x evaluated for row r in table t translated 
into a corresponding SMT-Lib condition/expression. 

name(x) 

if (x refers to a database table name) then 

The symbol that represents the current content of table x 
else if (x refers to a database attribute name) 

The symbol that represents the current values of attribute x 
else if (x refers to a Java variable name) 

The symbol that represents the current content of the Java variable x 

att; 

Name of the ith attribute in the list of attributes of table {id) 

pk 

Name of the primary key attribute of table {id). 


Position of primary key in the list of attributes of table {id) 


Name of the table referenced by the i^ foreign key in the list of foreign 
keys of table {id) 

Ikf 

Name of the primary key attribute of the table referenced by the ith 
foreign key in the list of foreign keys of table {id) 

i 

Position of the foreign key attribute, declared by the ith foreign key in 
the list of table {id), in the list of attributes of table {id) 

~W^ 

1 

Name of the table where is declared the i,/, foreign key referencing table 
{id) in the whole schema 

ifkf' 

Name of the foreign key attribute declared by the ith foreign key 
referencing table {id) in the schema 

nos 

cof 

Position of the attribute constrained by the ith check constraint declared 
in table {id) 


Inverted right part of the i,/, check constraint declared in table {id) (i.e. 
inverted right part of ”a > 0” is ”< 0”) 
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Table 2: SMT-Lib constraints generation rules for INSERT statements 
INSERT INTO (id) VALUES (.(int-expr)i , ... , {int-expr)^ , ... , {int-expr}„) 

if (no exception thrown in path for this INSERT) { 

; Inserted primary key value does not already exist 

(assert(forall((a(td)))(=>(name((trf)) a)(not(= (name(pk) a) smt20f({int-expr)p^^po.'!)))))) 
; Inserted values constrained by the i'* foreign key reference existing rows 
(assert(exists((a fk('’*))(and(= (name(fk|’*) a) smt20f({int-expr)^^po^)) (name(fk“*) a)))) 

; Symbol for new table content 
(declare-fun freshSym {{id)) Bool) 

; Constraints describing new table content 

(assert (forall ((a (id))) (=> (name((id)) a) (freshSym a)))) 

(assert (exists ((a (id))) (and (= (att, a) smt20f({int-expr)i)) (freshSym a)))) 

(assert (forall ((a (id))) 

(=>(and(not (name((id)) a))(not (= (att,- a) smt20f({int-expr)i)))){not (freshSym a))))) 
; No duplicate inserted row 
(assert (forall ((a {id}) (b {id})) 

(=>(and(and (freshSym a) (freshSym b)) (= (pk a) (pk b))) (= a b)))) 

} else { 

// Logical disjunction between every possible constraint 
// violation given the database schema and this insert: 

; The inserted primary key value already exists in the table 

(exists ((a {id))) (and (name((tff}) a) (^ (name(pk) a) sint20f{{int-expr)^^p<>^ )))) 

; inserted foreign key value does not reference a row: 

(forall ((a fk|‘'*))(=> (name(fk“*) a) (not (= (nameffkj’*^) a) smt20f{{int-expr)^^pps))))) 

; An inserted attribute violates the i'* check constraint: 

(not (co"®*' smt20f{{int-expr)(-Qpi>!))) 

} 
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Table 3: SMT-Lib constraints generation for UPDATE statements 
UPDATE {id) SET {id)an = (db-int-expr) WHERE (db-cond) 
if (no exception thrown in path for this UPDATE) { 

; Symbol for new attribute values 
(declare-fun freshSym {(id)) Int) 

; Constraints describing new attribute values 

(assert (forall ((a (id))) (=> (or (and (name((id)) a) (not smtlOf((db-cond),(id),&))) 
(not (name((id)) a))) (= (freshSym a) (name((tci)„„) a))))) 

(assert (forall ((a (id))) {—> (and (name((id)) a) smtlOf((db-cond),(id),af) 

(= (freshSym a) smtlOf((db-int-expr),(id),a))))) 

; Update on attribute constrained by foreign key not leave pending references 
(assert (forall ((a (id))) (=> (name((trf)) a) 

(exists ((b fk“*)) (and (name(fk“*) b) (= (freshSym a) (nameffkC*) b ))))))) 

; Update on attribute constrained by primary key not leaves duplicate attribute values 
(assert (forall ((a (id)) (b (id))) 

(=> (and (and(name((id)) a) (name((irf)) b)) (= (freshSym a) (freshSym b))) (= a b)))) 
; Update on primary key referenced by foreign key does not leave pending references 
(assert (forall ((a ifk“*)) (=> (namefifkf*) a) 

(exists ((b (trf)))(and (name((jrf)) b) (— (name(ifkf') a) (freshSym b))))))) 

; Update on attribute constrained by check constraint does not violate the constraint 
(assert (forall ((a (id))) (co"**' (freshSym a)))) 

} else { 

// Logical disjunction between every possible constraint 
II violation given the database schema and this update: 

; Update on primary key leads to duplicate attribute values 

(exists ((a (id)) (b (id))) (and(and (name((irf)) a) (and (name((id)) b) (not (= a b)))) 
(or (and smtlOf ((db-cond),(id),a) (and smtlOf ((db-cond),(id) ,h) 

(= smtlOf ((db-int-expr),(id),a) smtlOf ((db-int-expr),(id),b}))) 

(and (not smtlOf ((db-cond),(id),a)) (and smtlOf ((db-cond),(id) ,h) 

(= (name«td)„„) a) smtlOf ((db-int-expr),(id),b]))))) ) 

; Update on primary key referenced by the i’’' foreign key leaves pending references 
(exists ((a (id)) (b (id))) (and (and (name((jd)) a) (name(ifk“*) b)) 

(and (and (not (= (name((frf)^„) a) smtlOf ((db-int-expr),(id),a))) 

{= {r\ame((id)„„) a) (namefifkC*^) b))) smtlOf ((db-cond),(id),a)))) 

; Update on attribute constrained by foreign key leaves pending references 
(exists ((a (id))) (and (and (name((id)) a) smtlOf ((db-cond),(id),a)) 

(not (exists ((b name(fk“*’))) {— (name(fk|’*) b) smtlOf ((db-int-expr),(id),a)))))) 

; Update on attribute constrained by check constraint violates the constraint 
(exists ((a (id))) (and (and (name((*d)) a) smtlOf ((db-cond),(id),a)) 

(not (co"**‘ smtlOf ((db-int-expr),(id),a))))) 

} 
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Table 4: SMT-Lib constraints generation for DELETE statements 
DELETE FROM (id) WHERE (db-cond) 
if (no exception thrown in path for this DELETE) { 

; Symbol for new table content 
(declare-fun freshSym {(id)) Bool) 

; Constraints describing new table content 
(assert (forall ((a (id))) 

(= (freshSym a) (and (name((id)) a) (not smt20f({db-cond),{id),a)))))) 

; Delete does not leave pending references for foreign key 
(assert(forall ((a fkf*) (b {id})) 

(=> (and (name((trf)) b) (and (not (freshSym b)) (namefifk'"*) a))) 

(not (= (name(pk) b) (name(ifkf') a)))))) 

} else { 

// Logical disjunction between every possible constraint 
II violation given the database schema and this update: 

; Delete leaves pending references for foreign key 
(exists ((a fk“*) (b {id})) 

(and (and (and (name((id)) b) (namefifk)"*) a) ) smt20f{{db-cond},{id},h)) 
(= (name(pk) b) (name(ifk“") a)))) 

} 


Table 5: SMT-Lib constraints generation rules for add(int) and remove(int) statements 
(irf>.add( {int-expr} ); 

(declare-const freshSym BoundedList) 

(assert (not (isNull name((id))))) 

(assert (not (isNull freshSym))) 

(assert (= (size freshSym) (-1- (size name((td))) 1))) 

(assert (= (elements freshSym) 

(store (elements name((td))) (size name((id))) smt20f({int-expr}}))) 


(id).remove( {int-expr} ); 

(declare-const freshSym BoundedList) 

(assert (not (isNull name((id))))) 

(assert (not (isNull freshSym))) 

(assert (>= (size name((id))) 1)) 

(assert (= (size freshSym) (- (size ”-|-oldVar-|-”) 1))) 

(assert (>= smt20f {{int-expr}) 0)) 

(assert (< smt20f {{int-expr}) (size name((tci))))) 

(assert (forall ((i Int)) (=> (and (>= i 0) (< i smt20f {{int-expr}))) 

{= (select (elements name((*d))) i) (select (elements freshSym) i ))))) 
(assert (forall ((i Int)) (=> (and (>= i smt20f {{int-expr})) (< i (size freshSym))) 
(= (select (elements name((*d))) (+ i 1)) (select (elements freshSym) i ))))) 
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3. Experimental evaluation 

3.1. Symbolic execution and path exploration 

Originally introduced in [5], symbolic execution has been used as the core principle 
of many test data generation techniques. In some of these techniques (see e.g. [4T] for 
an overview), symbolic execution is performed for a finite set of finite paths that are 
statically explored in the control-flow graph, i.e. independent of any concrete input values, 
in accordance with a given coverage criterion [3]. By solving the constraints associated to 
each of the paths, one obtains a set of inputs and outputs that satisfy the given coverage 
criterion. 

More recently, test data generation techniques that combine symbolic execution with a 
dynamic path exploration process have also been proposed (see e.g. [nilSZlIMl)- The 
point of using a dynamic path exploration process is to provide seamlessly symbolic 
execution with concrete values to replace the statements for which constraints cannot 
be generated (like proprietary API calls) or handled by the solver (as they belong to 
a complex, exotic and/or undecidable logic). This process is called concretisation and 
is deeply discussed and evaluated in [32] • In practice, the program is first executed 
on concrete inputs to produce concrete outputs, but the code is instrumented so that 
symbolic execution is performed together with this normal run of the program, thereby 
generating the constraint system corresponding to the concrete execution. By flipping 
some of the constraints among those generated by this symbolic execution, one may 
produce constraints whose solution describes new concrete inputs triggering the execution 
of a different path. The process is then repeated with these new inputs until a set of 
inputs and outputs has been generated for a set of paths covering a sufficient part of the 
code, again according to some coverage criterion |4]. 

The principle of relational symbolic execution is orthogonal to the path exploration 
process, and can be combined with both a static or dynamic approach. However, in order 
to evaluate the ability and efficiency of our technique, we have built a prototype tool 
that integrates an implementation of the relational symbolic execution presented in the 
previous section with a static path exploration process. 

3.2. Experimental framework 

Our tool works as follows. Given a method to test, the tool simply performs a depth-first 
search of the control-flow graph, considering all the paths that execute the body of each 
loop in the method at most K times, where /T is a parameter of the tool. Consequently, 
our prototype satisfies a finitely applicable variant of the common path-coverage criterion 
|4|, similar to the loop count-K criterion originally proposed in [48] . 

In parallel to this exploration of the control-flow graph, the tool applies the relational 
symbolic execution algorithm proposed in section and solves the produced constraints 
using the Z3 solver. Input and output data (including database content) are extracted 
from the solution to the produced path constraints and recorded as constituting test-cases. 
Once all the paths have been explored, a so-called test-suite comprising all the generated 
test-cases is returned to the user. The tool also keeps a separate list of the paths for 
which the solver heuristics fail solving the constraints. 

From a technical standpoint, the tool was coded in Java 1.6 and run on a dual core 
Intel Core i5 processor at 1.8GHz (256 KB L2 cache per core and 3 MB L3 cache) with 
8GB of dual channel DDR3 memory at 1600 MHz. The runtime environment was the 
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Oracle JVM 1.6.0_45 under a 32-bit edition of Microsoft Windows 8.1. The version 4.3.0 
of the Microsoft Z3 solver was used. 

The sample code used for our experimental evaluation is composed of eighteen Java 
methods, performing SQL operations over relational databases. These methods can be 
divided into four groups: 

• The first group contains nine methods. Each of them was crafted to systematically 
evaluate the correct symbolic execution of one of the different Java and SQL constructs 
handled by our algorithm. As such, the methods in this group exercise the different 
behaviours of the integer and list operators, conditional and loop statements, SQL 
statements and transaction management primitives. 

• The second group contains three methods crafted as typical implementations of 
OLTP use cases. The first method in this group performs repeated manipulations 
of integers and lists to compare and save some data. The second method performs 
many interleaved reads and writes in a database containing four tables, representing 
regular or prospect customers that make purchases of products. The third method 
mixes SQL statements with traditional Java code and uses SQL transactions. The 
manipulated database contains two tables that represent authors writing theatre 
plays. 

• The third and fourth groups contain Java methods extracted respectively from 
UnixUsag^and Riskllj^ the two open-source software that have been used in [25] . 
as a basis for evaluating the proposed test generation approach, based on SQL 
normalisation into native code. UnixUsage is a monitoring application for Unix, 
manipulating a database with eight tables and thirty one attributes. Riskit is an 
insurance application, manipulating a database with thirteen tables and fifty-seven 
attributes. 

Together, the methods from our testbed constitute a set of five hundred lines of code, 
containing notably eighty SQL statements (including SELECT, INSERT, UPDATE, 
DELETE statements, as well as transaction management code), over databases containing 
up to thirteen tables (subject to primary key, foreign key and check constraints). 

Detailed statistics for each of these methods can be found in table including the value 
of K used in our tool to limit the loop exploration depth for each method. Given this 
value of K, the number of paths to be explored is provided for each method, detailing the 
number of feasible and infeasible ones. The high number of infeasible paths is principally 
due to the methods in the second group (and, to a lesser extent, the methods for testing 
conditional constructs from the first group), methods that were particularly constructed 
in this way for assessing the soundness of our test generation tool. 

Finally, the code of these methods, as well as the generated test data, can be found on 
the wet0 


http: //sourceforge .net/projects/se549uiiixusage 
https: //riskitinsurance. svn. sourceforge.net 
"https://staff.info.unamur.be/mmr/scp/ 
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3.3. Evaluating the Alloy and Z3 solvers for relational symbolic execution 
3.3.1. Performance comparison over common samples 

Our first objective with the experiments was to compare the ability of the Alloy and Z3 
solvers to solve effectively and efficiently the constraints produced by relational symbolic 
execution. In order to do so, we used the Alloy version of our tool from [26] and the 
SMT-Lib version presented here to generate test cases for the methods of the first and 
second groups from our testbed. 

The Alloy solver basically solves the constraints by setting upper bounds on the scope 
of the possible integers, as well as on the cardinality of the possible different rows which 
can appear in the solution for each different table. This makes finite the set of possible 
solutions, which can then be exhaustively explored, by transforming the constraints into 
an equisatisfiable SAT instance, solved by a dedicated SAT solver. As a consequence, if 
no solution is found with the default minimal bound value, our tool should repeat the 
process recursively with a higher value as bound, until a time-out is reached. When the 
time-out is reached, the tool considers the underlying path as unfeasible. The time-out 
value should thus be long enough to enable finding a proper model for feasible paths. 
At the same time, it should not be too long to detect infeasible paths in minimal time. 
Finding an optimal time-out value appeared to be very difficult in practice, as it depends 
on the size and complexity of the constraints to solve. 

In the left part of table [7j we synthesise the results obtained with the Alloy version 
of the tool, over the feasible paths of the methods from groups one and two. For each 
method, the table provides the minimal bound values enabling Alloy to find inputs for 
all the feasible paths in the method. It also provides the total number of constraints 
generated for these feasible paths and the minimal time within which the solver was 
able to solve these constraints. The methods named ’’Integers”, ’’Update”, ’’Delete” and 
’’Clients and Products” involve either large integer values or repeated actions on a single 
table. As a consequence, the constraints generated for these methods require large 
enough bound values to be solved, which had in practice to be found manually, using a 
costly trial and error approach. Moreover, the results show that once the bounds are 
increased, the size of the SAT instance generated by the Alloy solver quickly explodes, 
while the solving time blows up from dozens of milliseconds to several minutes. Finding 
inputs for the ’’Handling Data” and the ’’Play Catalog” methods, with their thousands of 
paths, revealed to be intractable in reasonable time, with the Alloy solver and without 
any manual help. 

In accordance with |55], where Z3 is used as a complement to the Alloy solver for 
proving instances unsatisfiable, the Z3 solver was able to detect almost instantaneously 
the unsatisfiability of the generated constraints. As a consequence, the SMT-Lib version 
of the tool was able to detect properly and efficiently the thousands of unfeasible paths in 
the methods from groups one and two. As we used a more recent version of Z3 than in 
|28j . involving notably a new model-based quantifier instantiation technique for solving 
quantified constraints, we could hope that the solver would also be able to find efficiently 
satisfying models for the generated constraints and thus produce inputs for satisfiable 
paths. The obtained results are detailed in the right part of table [Tj both for feasible and 
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Table 6: Statistics for the selected samples 


Number of Paths 

Infeasible 

CRAFTED SAMPLES 

- Language Unit Testing 

O 

O 

319 

1 

CO 

63 


15 

1 


- Realistic OLTP Methods 

3513 

32 

5569 

SAMPLES EXTRACTED FROM OPEN-SOURCE SOFTWARE 

- UnixUsage 

O 

o 

o 

O 

- Riskit 

T—1 

CM 

181 

10046 

Feasible 

t-H 

1 

t-H 

1 

t-H 

T—1 

t-H 

1 

CM 

23 


63 

CM 

CM 

CM 

CM 

CO 

CM 

(CO 

1-H 

1-H 

K Depth 

CO 

CO 

t-H 

1 

CO 

CO 

CO 

CO 

(CO 

iO 

lO 

CM 

T— I 

1-H 


1 

CM 

CM 

up to 5 

Number 
of tables 

o 

o 

O 

o 


(M 

CM 

CM 

CM 

1-H 


CM 

00 

00 

00 

00 

CO 

1 

13 

up to 13 

Number of 
SQL statements 

o 

o 

o 

o 



t-H 

t-H 

CO 

CM 

1-H 

00 
T— I 


T— I 

1-H 

T—i 

1 


t-H 

o 

00 

LOCs 



30 

30 

23 

1 

20 


13 

45 

56 

72 


10 

o 

T— I 

1 

1 

1 

55 

60S 

Name 

Integers 

Lists 

If and While 

Conditions 

Select 

Insert 

Update 

Delete 

Transactions 

Handling Data 

Clients and Products 

Play Catalog 

courseN ameExists 

getCommandsByCategory 

getCourselDByName 

getDepartmentID ByN ame 

createNewUser 

deleteUsers 

ALL METHODS 
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infeasible paths. The Z3 solver was able to find inputs for each feasible path in milliseconds, 
always outperforming the Alloy solver, even when minimal bound values were used in the 
Alloy version. As an example, the Alloy version took more than thirty eight minutes and 
a half to hnd test inputs for the four feasible paths of method ’’Clients and Products”, 
when the Z3 version was able to do it in three seconds and a half. Finding inputs for 
the ’’Handling Data” and the ’’Play Catalog” methods, which was intractable using Alloy, 
became possible in less than two minutes using Z3. 


3.3.2. Practical scalability limits 

Our second objective with the experiments was to estimate the practical scalability 
limits of current solver technology for the generated constraints. In order to do so, we 
have measured the number of generated constraints and the constraint solving time for 
relational symbolic execution (using Z3) of the following method, containing a single 
execution path, traversing a linear block of SQL statements. These statements have 
been selected to provide a balanced mix of typical queries and DML statements. DML 
statements act on fields subject to integrity constraints. 

CREATE TABLE tl ( 

idtl INTEGER NOT NULL, 

fieldtl INTEGER NOT NULL, 

CONSTRAINT tlPK PRIMARY KEY (idtl), 

CHECK(idtl > 0)); 

CREATE TABLE t2 ( 

idt2 INTEGER NOT NULL, 

fieldt2 INTEGER NOT NULL, 

CONSTRAINT t2PK PRIMARY KEY (idt2), 

CONSTRAINT t2FK FOREIGN KEY (fieldt2) REFERENCES tl(idtl), 

CHECK(idt2 > 0)); 


void test (Connection con,Scanner in) throws SQLException { 
int i = 1; 

con.createStatement().execute(”INSERT INTO tl VALUES (”+i+”,”+i+”)”); 
con.createStatementQ.executeC’INSERT INTO tl VALUES (”+(i+l)+”,”+(i+l)+”)”); 
con.createStatementQ.execute^INSERT INTO t2 VALUES ^+i+”,”+i+”)”); 
con.createStatementQ.executeC’INSERT INTO t2 VALUES (”+(i+l)+”,”+(i+l)+”)”); 
int inputl = in.nextint(); 

ResnltSet resultl = con.createStatement().executeQnery(”SELECT idtl 

FROM tl 

WHERE fieldtl=”+i); 

resultl .next (); 

con.createStatement().execute(”DELETE FROM t2 WHERE idt2=”+inpntl); 
con.createStatenient().execute(”UPDATE t2 

SET fieldt2 = l+fieldt2 

WHERE idt2 < ”+(2+resultl.getInt(”idtl”))); 

inputl = in.nextint 0; 

con.createStatement().execute(”DELETE FROM tl WHERE idtl=”+inputl); } 


Then, the measurement was repeated for a similar path, but where the block of SQL 
statements was executed twice in turn; a first time normally and a second time with a 
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value of i increased by 2. The statements of the second round were executed directly 
on the database resulting from the hrst round so that they can modify the rows it 
contains as well. The process was then repeated with four rounds and so on. The 
obtained measurements are reported in figure and These graphs show the number of 
constraints and the constraint solving time, as a function of the number of SQL statements 
executed in the path. The number of constraints evolves linearly with the number of SQL 



Figure 2: Constraint number as a function of the number of SQL statements in the path 
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Figure 3: Solving time as a function of the number of SQL statements in the path 


statements. The constraint solving time increases much more quickly with the number 
of SQL statements. The solving time starts to rise around 60 statements and becomes 
important above 100 statements, on our Intel Core i5 at 1.8GHz. 
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5 .^. Benchmarking relational symbolic execution against normalisation 

Our last objective with the experiments was to compare the performance of relational 
symbolic execution with the approach proposed in [23]. In this last approach, the tested 
code is preprocessed to translate the SQL code into native program code, before applying 
a classical symbolic execution process. In order to perform this comparison, we have 
selected four simple Java methods from the testbed used by [53j and used our tool on 
them. These methods are those of the third group from our testbed. 

Our tool was asked to generate tests with the same statement coverage as offered by 
|23j . In order to propose a fair comparison, the total generation time for each method 
was measured on a Pentium 4 configuration (3.06GHz, IGB RAM, Microsoft Windows 
XP 32-bit), similar (but admittedly not identical) to the one used in [23]. The obtained 
results are synthesised in the first part of table Results for the normalisation-based 
tool are extracted from |23|. While the results are of course not fully conclusive as both 
tests were run on different (but comparable) machines, we can nevertheless observe our 
tool to be between one and two orders of magnitude faster than [23] . generating in each 
case tests in a few seconds compared to more than one minute and a half. 

A first reason for such a performance difference between both tools is that |23| normalises 
the SQL code into native code before trying to generate tests. This requires time and 
strongly increases the number of paths to be explored, as each SQL statement is translated 
into native code that introduces new branches and cycles in the control-flow graph. 
Secondly, |53| is built upon the Microsoft Pex tool for constraint-based testing, which 
makes use of a dynamic path exploration, requiring to effectively run the code for each 
generated test case. Our research prototype tool makes use of a static path exploration and 
does not need to run the code, which also has an impact on our performance comparison. 
However, this impact is limited by fact that the methods are very small, involving between 
7 and II simple statements, with only one simple SQL (SELEGT) statement, one loop 
and no conditional per method. Finally, it should be noted that our tool handles the 
string attributes appearing in the methods as integer ones. This is possible as the tested 
methods do not use string-specific operators. 

3.5. Additional experiments over open-source code 

UnixUsage and Riskit provide a useful source of open-source code for evaluating our 
testing tool. As a consequence, we have also extracted some methods involving not only 
SELECT statements, as in |53|, but also SQL statements writing into the database, in 
order to assess our tool. These methods are those of the fourth group from out testbed. 

Test data generation for these methods required extending our symbolic execution 
algorithm to handle some new parts of Java and SQL, used in Riskit, or to simulate 
them the currently handled sublanguage. In particular, the management of tables with no 
primary key or with multiple-attribute primary key was integrated in the algorithm and 
string management operations were simulated using either integers or lists of integers. As 
the second part of tablej^shows, correct test data were generated in just a few seconds and, 
moreover, manual reviewing of the automatically generated test data enabled detecting a 
possible fault in the code of Riskit, where a runtime error is thrown when the method 
createNewUser is called on inputs where the inserted job does not reference any existing 
occupation or industry. 
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4. Discussion 


4-1- Synthesis of research contributions 

In this work, we proposed an approach for enabling the direct symbolic execution of SQL 
code into constraints. This is a non-trivial extension to traditional symbolic execution 
because of the complex structure of relational databases and the complex behaviour of 
SQL statements. Given a database program mixing traditional code with SQL statements, 
each database table manipulated by the program is modelled as a variable typed as a 
relation and each SQL statement as a relational operation over both these relational 
variables and the traditional variables of the program. A classical symbolic execution 
process can then be applied to produce sets of mixed relational and classical constraints 
over symbols representing the values of both the classical and relational variables of the 
program. The resulting path constraints can be unified with the data integrity constraints 
from the database schema. Any solution to the resulting constraint system for a path 
describes input and output data for the program, including a valid database content, with 
respect to which the program can be executed and is guaranteed to follow the same path 
for which the constraints were generated. 

A symbolic execution algorithm based on this principle has been completely detailed 
for a precise subset of Java and SQL. This language enables writing Java methods that 
use SQL statements and transactions to read and write in a relational database; the latter 
typically subject to data integrity constraints. The algorithm has been designed to test 
rather small methods, acting on a limited number of tuples in the database, as they are 
typically written in programs acting in an OLTP context, to implement single business 
use cases. Given the schema of the database, the code of the method and an execution 
path in this method, the algorithm performs the symbolic execution of the path and 
produces the corresponding constraints in the standard SMT-Lib language. 

The algorithm has been used in a prototype tool to generate test data for a number of 
methods, including some open-source code. The generated constraints have been solved 
using the Microsoft Z3 solver. The experiments show that the technique is able to generate 
test data for all the considered methods, seamlessly and in reasonable time. In particular, 
the results dramatically improve the scope of the approach, compared to the strategy 
based on the Alloy solver, proposed in our previous work [25l[26]. These results provide 
both an experimental confirmation and new elements to the research presented in |28j , 
where Z3 was used to prove the unsatisfiability of Alloy constraints. Our experiments can 
indeed be seen as some kind of a case study for [28) . which confirms the conclusions of 
[S5] about proving unsatisfiable instances. Moreover, our results show that versions of Z3 
more recent than the one used in [28) . including new model-based quantifier instantiation 
techniques, can outperform the Alloy solver also in model finding for satisfiable instances. 
Finally, our measurements showed that the approach may face scalability issues outside 
the context of methods implementing OLTP use cases, if it is used as such over pieces of 
code whose typical execution scenarios involve the processing of a large number of SQL 
statements. 
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Jf..2. Related work 

An early work that has considered test data generation for programs interacting with 
a relational database is |22]- The paper suggests to transform the program by inserting 
new classical variables representing the database structure, and translating all SQL 
statements into native program code acting on these variables. A translation to C++ 
is provided for some relational operators and for some other SQL mechanisms, like row 
sorting. Classical white-box testing approaches can then be applied to the normalised 
program. A conceptually similar but entirely automated technique is proposed in [23], 
where the off-the-shelf Microsoft Pex tool, based on symbolic execution and a dynamic 
path exploration process, is applied to the normalised version of database programs 
written in C#. This latter approach is validated over 39 Java methods (translated to 
C# using an automated compiler) extracted from UnixUsage and Riskit. These samples 
involve 32 LOCs per method on average, with a maximum of 108 LOCs, and each method 
mixes Java statements with one or a few SELECT SQL statements. 

Conceptually, normalising SQL code into native code and then applying classical 
symbolic execution on the result is an alternate approach to ours, where the SQL code 
is directly compiled into relational constraints during symbolic execution. Replacing a 
single SQL statement by a piece of native code, simulating its execution by a DBMS, is 
time-consuming and may strongly increase the number of paths to be explored, compared 
to the original code [55]. In contrast, in relational symbolic execution, the code must 
not be preprocessed and the execution paths to be considered are limited to paths in 
the original program code. Our experimental benchmarks, comparing the tool from 
|23j . based on normalisation, with our tool, tend to indicate that a direct translation 
of SQL into constraints provokes a strong performance improvement. Moreover, even if 
the symbolic execution of INSERT, UPDATE and DELETE statements is conceptually 
possible using normalisation, |23| only validates their approach over code containing 
SELECT statements. Relational symbolic execution has on the contrary be experimented 
in the presence of SQL code writing into the database, as well as in the presence of SQL 
transactions. 

On the other hand, the tool proposed in |53| provides important features that are 
not handled by ours. Eirst, it deals with more complex SELECT queries. Secondly, it 
relies on Microsoft Pex, as an off-the-shelf back-end constraint-based testing tool. Eirst, 
Pex handles natively a large part of the C# language, where our tool is restricted to a 
small part of Java. Secondly, Pex provides some support for character strings, which 
constitute an important datatype in SQL, not handled by our tool. Thirdly, Pex uses 
a dynamic path exploration process (coupled with heuristics for a smarter covering of 
large control-flow graphs), making concrete values for the program variables available if 
necessary. This notably gives a direct access to the concrete SQL code that can be, in 
some programs, crafted dynamically as a character string, to be parsed and processed by 
the DBMS. As a consequence, the tool from |53| can account, at least to some extent, for 
such dynamically crafted SQL statements, where our tool only considers SQL statements 
whose structure is completely defined statically. 

In [ 21 ], the JCUTE symbolic execution tool for Java is provided with the ability to 
generate database inputs in the presence of simple SELECT queries in the tested code. The 
principle is to consider the result of any query as an independent array-like input for the 
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program. The size of each of these input arrays, as well as the content of each of their cells 
can then be accessed in the normal code. As a consequence, the path constraint generated 
in JCUTE contains constraints over these sizes and contents. Additional constraints 
are added to the path constraint, enforcing that each row in an input array makes the 
condition of the WHERE clause from its corresponding SELECT true. Constraints are 
solved using an ad-hoc solver for strings. Experimentation is discussed for a 16 lines of 
code database program written in Java involving one query extracted from open-source 
software. 

This approach exhibits some conceptual limitations compared to relational symbolic 
execution. It indeed provides no mean to handle INSERT, UPDATE and DELETE 
statements, neither to account for transaction management primitives, which are crucial 
components of database programs. Moreover, [24] only provides a conceptual intuition 
and no proper mechanism for enforcing the data integrity constraints defined in the 
database schema. (53] advocates that such a lack leads to the possible generation of 
invalid test data and to a poor code coverage. 

Nevertheless, the tool detailed in [24] has also important features that are not present 
in our tool. First, and contrary to our tool, it deals with null values and, at least partly, 
with character strings. Secondly, it relies on the JCute constraint-based testing tool. 
JCute handles a large part of the Java language, including multi-threading, and uses a 
dynamic path exploration process. As with Pex, this makes the tool from [21] able to 
benefit from concretisation as well as to account, at least to some extent, for dynamically 
crafted SQL statements. 

On a related but complementary level, a substantial amount of work (e.g. mss 

has been done on how to generate test 
database content exhibiting some desirable properties, given only the database schema 
and possibly some queries to be executed over the database. The main difference between 
our work and these approaches is that they essentially work without considering the 
control flow of the programs manipulating the database. 

Microsoft Qex [531 [3D] is probably the one of these techniques which is the closest 
to our work, as both approaches are based on the translation of the SQL semantics 
into SMT-Lib constraints, solved using Z3. However, the two approaches also exhibit 
important differences. Firstly, Qex only considers input generation for a single SQL 
SELECT query in isolation, but the handled queries can be more complex than in our tool. 
Secondly, the two techniques translate the SQL semantics into constraints differently: our 
tool use predicates and quantifiers to represent relations, while Qex use fixed-size lists of 
tuples. Qex involves thus a mechanism similar to Alloy, where the solving process must 
be repeated on increasing size for the input relations, in order to find a solution. Such an 
approach does not enable proving the constraints unsatisfiable. 

Some database programs are developed to work with data already stored in an existing 
database. Some papers [571 [55] study the particular situation where classical test inputs 
must be generated for code manipulating in parallel an existing database with a known 
input content. 

Other work [5D1ISD1ISI] considers mutation testing of database programs, where our 
approach performs structural testing. In mutation testing, the quality of the test data 
is no more measured in terms of code coverage, as in structural testing, but in terms of 
program fault detecting ability (see [3] for a discussion). Some works have also focused 
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over testing of non-functional aspects of database programs, like security testing [5^ . 

4-3. Threats to validity 
4-3.1. Internal validity 

The performance comparison between our relational symbolic execution tool and the 
SQL normalisation tool from [23j revealed a very strong advantage for relational symbolic 
execution, over similar hardware configurations. However, the comparison was performed 
on a limited number of small samples and considering tools with different architectures. 
We think that the impact of the last element is minor, given the small size of the tested 
methods, and cannot explain, on its own, how important the measured performance gap 
is. However, this element, as well as the lack of a comparison over a large number of 
various complex pieces of code, threaten our ability to conclude, from our experimental 
work, that a direct translation of SQL into relational constraints intrinsically improves the 
testing time, compared to compiling SQL into native code before generating constraints. 

4-3.2. External validity 

The tool we have developed handles a limited subset of the Java/SQL syntax and 
was developed in the context of transactional business use cases, i.e. small pieces of 
code acting on a limited number of rows in the database. Several elements threaten the 
generalisability of our approach to handle any piece of code involving SQL. 

Conceptually, our approach can be integrated into any existing symbolic execution tool, 
to provide this tool with the ability to handle SQL code. An integration into a state-of-art 
tool, able to handle not all, but a wide range of programs in a mainstream programming 
language, would strongly improve the practical scope of our technique. However, such an 
integration needs to be evaluated in practice. 

The handling of a larger part of SQL may require the generation of constraints which 
could be hard to manage by current solver technology. Moreover, as SQL is a large and 
complex language, fragmented into several dialects, developing an universal SQL symbolic 
execution engine, able to handle any piece of SQL code, will be a very hard task. 

Our tool provides no automated way to handle SQL code dynamically crafted by 
the program, which can be frequent in practice. Alternate tools have shown that we 
could exploit the concretisation mechanism from a state-of-art symbolic execution tool to 
alleviate the problem, thanks to the availability of the concrete values for the dynamically 
crafted SQL code. However, such an approach fails handling cases where the syntactic 
structure of the crafted SQL depends on the inputs of the tested code. 

Finally, current solver technologies have shown to face a scalability issue when required 
to solve the constraints generated by our tool for a sufficiently large number of SQL 
statements. Whether this problem can be solved or at least alleviated, by optimising the 
constraint generation or by new solving capabilities, remains an open research question. 


36 


4-.4- Future work 

For further work, we have identified the following three main research directions: 

• Our tool has demonstrated how a classical symbolic execution mechanism for a typical 
programming language could be empowered with the ability to generate constraints 
in the presence of SQL code. As this integration provided promising results, it should 
now be repeated with a state-of-art tool based on symbolic execution. Microsoft 
Pex |18j is a particularly appropriate candidate for integration with our work, as it 
is based as well on the Z3 solver. In addition, such an integration could enable a 
deeper comparison with the tool from [23], which is also based on Pex. The Pex tool 
is not open-source but provides an extension interface. To the best of our knowledge, 
available open-source constraint-based testing tools are CREST |B5] and KLEE m 
(for C code) as well as Symbolic Path hinder [T^ (for Java byte-code). 

• By integrating our approach in a tool like Pex, the resulting tool would of course 
keep its advanced features, like a large syntax coverage for a modern programming 
language, as well as a dynamic path exploration process coupled with heuristics 
to handle large number of paths. In presence of SQL code built dynamically as a 
character string in the tested code, the use of a dynamic path exploration would 
make the concrete values of the assembled string elements directly available. This 
runtime information should be used to recover the complete structure of the executed 
SQL code, making us able to translate it into constraints, as described in this paper. 

However, such an approach will fail if the program’s inputs are used as parts of 
the syntactic structure of the dynamically-crafted SQL code. Nevertheless, by 
choosing appropriate concrete values for those parts of the inputs that are used to 
define the syntactic structure of the dynamic SQL statements, one could produce 
representative specialised versions of the original program that could be properly 
evaluated symbolically. Interleaving symbolic execution with such a partial evaluation 
|M] has already been studied in another context by |5S|. Detecting which parts of 
the programs’s inputs should be made concrete could benefit from existing work (e.g. 
[MIISZ]) over detection of SQL injections attacks. 

• In the perspective of testing programs involving many complex SQL statements, a 
tighter integration with constraint solving techniques would be beneficial to offer 
a better scalability and a larger scope to the approach. The constraint generator 
should be tailored so to generate constraints optimised for the internal algorithms of 
the solver. Conversely, the development of solving algorithms or heuristics tailored to 
the kind of constraints produced by the symbolic execution of large pieces of complex 
SQL code should also be considered. 

In particular, SQL enables various operations to be performed over data belonging 
to various datatypes, such as strings, binary objects, numeric values and timestamps. 
Symbolic execution of such operations will produce complex constraints over such 
datatypes. If modern SMT solvers like Z3 can already handle many of these con¬ 
straints, work should be done to locate the common parts of SQL which will put 
current solvers into trouble and research should be performed to possibly build a 
workaround. Solver development is a particularly dynamic research domain. Notably, 
research is ongoing (e.g. [b8L ItiHj l towards a proper solving of string contraints 
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inter-related with other kinds of constraints, in the context of symbolic execution. 
Similarly, recent works (see e.g. izni) have considered (partial) solving of non-linear 
integer arithmetic constraints. Other works have also studied the particular problem 
of multi-granularity temporal constraint solving (see e.g. [71)1. 

However, building optimal constraint generation rules for the whole of SQL is made 
difficult by the fact that the syntax and semantics of SQL is large and complex, and 
can vary strongly in practice between different DBMS’s versions and manufacturers. 
A research direction for overcoming these difficulties would be to use a relational 
algebra as an intermediate language for constraint generation. The symbolic execution 
engine would compile the original SQL code into a minimal relational algebra, and 
then the algebraic operators would be translated into logical constraints, using a 
minimal set of translation rules optimised for the solver. Algorithms translating SQL 
statements into equivalent combinations of a core set of relational algebra’s operators 
are well-known, in the context of query processing in DBMS design [72] . In practice, 
this idea should be refined, as SQL allows non-relational constructs like rows ordering 
and aggregation, null values etc. The intermediate language should thus be extended 
by a minimal set of operators for describing common and tractable non-relational 
parts of SQL. Concretisation could be the last-ditch solution to handle exotic or too 
complex parts of SQL. 

These research directions sketch a path towards a broader evaluation of the approach 
over many various real pieces of code, in an industrial context. 

Programs to be tested which manipulate an existing database are common in practice. 
Whether and how our technique could select test data from an existing database, instead 
of generating them from scratch is an interesting matter of further research. 

Constraints typically admit many different solutions. However, our tool uses the 
arbitrary solution returned as first by the solver. A possible improvement could be to use 
an optimal solution according to some criterion, like for example, a minimal number of 
rows in the database. In order to do so, it has recently been announcecQ that Z3 could 
be provided with the ability to return the solution which optimises a given objective 
function. Evaluating and integrating this mechanism with our approach is a topic for 
future research. 

Finally, being somehow parametrised with respect to the paths that should be considered, 
our approach can be used with respect to any traditional code coverage criterion based on 
the notion of execution path |4]. Nevertheless, several works Ha [Til Eg [761 ISO] propose 
coverage criteria particularly tailored towards testing of database programs. Integrating 
such criteria into our constraint-based approach is a topic of ongoing research. 
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